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in the Movement of Molecular Machines 


Joachim Rosskopf^ Korbinian Paul-Yuan^ Martin B. Plenio^, Jens Michaelis^ 

^ Institute of Theoretical Physics Ulm University, ^ Institute of Biophysics Ulm University 

Analyzing the physical and chemical properties of single DNA based molecular machines such 
as polymerases and helicases requires to track stepping motion on the length scale of base pairs. 
Although high resolution instruments have been developed that are capable of reaching that limit, 
individual steps are oftentimes hidden by experimental noise which complicates data processing. 
Here, we present an effective two-step algorithm which detects steps in a high bandwidth signal 
by minimizing an energy based model (Energy based step-finder, EBS). Eirst, an efficient convex 
denoising scheme is applied which allows compression to tuples of amplitudes and plateau lengths. 
Second, a combinatorial clustering algorithm formulated on a graph is used to assign steps to the 
tuple data while accounting for prior information. 

Performance of the algorithm was tested on Poissonian stepping data simulated based on published 
kinetics data of RNA Polymerase II (Pol II). Comparison to existing step-finding methods shows that 
EBS is superior in speed while providing competitive step detection results especially in challenging 
situations. 

Moreover, the capability to detect backtracked intervals in experimental data of Pol II as well as 
to detect stepping behavior of the Phi29 DNA packaging motor is demonstrated. 

^ The authors contributed equally to this article. 


Introduction 

Single molecule measurements of molecular motors al¬ 
low to study the motion of individual enzymes. The stud¬ 
ies range from enzymes making comparably large steps 
e.g. motor proteins like Myosin V [1] and Kinesin [2] to 
DNA based molecular machines which make steps on the 
scale of single nucleotides |3H6]. Experimental techniques 
to study these systems range from single molecule fluo¬ 
rescence localization [7] to optical and magnetic tweezers 
[8]. Most of these measurements represent the underly¬ 
ing dynamics as one-dimensional time series of positional 
changes. The enzymatic reactions which fuel this motion 
appear as stochastic events resulting in step-like move¬ 
ments [9] obliterated by noise. Nowadays state of the art 
optical tweezers experiments allow to study the move¬ 
ment of enzymes with a resolution down to single base 
pairs 13 HO]. For example, studies on the ip29 bacte¬ 
riophage ring ATPase pTHIS) used the information from 
step detection data to propose a complete model of the 
mechanochemical cycle. However, oftentimes analysis 
schemes rely on low pass smoothed data. 

Indeed, the problem of finding steps is not only lim¬ 
ited to studies of movement of enzymes but appears in 
a wide range of biomolecular experiments from fluores¬ 
cence resonance energy transfer trajectories m, to steps 
in membrane tether formation m, or the opening of ion 
channels [16], just to name a few. 

Consequently, there is a rich amount of signal process¬ 
ing techniques available to recover piecewise constant sig¬ 
nals from noisy data. Due to the stochastic nature of en¬ 
zymatic stepping the number of steps is often not known 
a priori. Therefore, different step finding algorithms have 
been developed [HHlo]. 

One class of algorithms determine steps from single 
molecule data based on statistical hypothesis testing in 


a moving window. A prominent example is the so called 
t-test, which is based on the Student’s t-test m- In this 
algorithm a step is recorded when the hypothesis that 
two normally distributed random variables have the same 
mean is violated. The mean is calculated with respect to 
a certain time window which is an input parameter that 
can be eliminated by sweeping through various window 
sizes. Thus, the t-test is conceptionally simple. How¬ 
ever, for situations with small step-sizes and as a result 
comparatively large noise, increasing window sizes are 
required, limiting efficient step-detection. 

Hidden Markov models (HMM) have been developed 
for situations with poor signal-to-noise ratio (snug. In 
HMM the signal is modelled as a Markov process with 
transitions between discrete states obliterated by Gaus¬ 
sian white noise. Thus, in the HMM analysis of step¬ 
ping data, transition probabilities of a Markov process 
are obtained from a maximum likelihood estimation and 
the steps are reconstructed using the Viterbi algorithm 
[23] , A HMM for processive molecular motor data re¬ 
quires many states to model the possible positions on 
the template, making it computationally expensive. Per¬ 
formance can be improved by cutting the signal at a pre¬ 
defined amplitude and transforming positions to periodic 
coordinates to limit the necessary number of states [22|. 
HMMs proved to be excellent tools for pattern recogni¬ 
tion in many fields. However, in addition to being com¬ 
putationally demanding they rely on assumptions about 
the hidden stepping process and about the noise model. 

Another popular class of step-finding algorithms re¬ 
construct the underlying step signal by successively in¬ 
troducing new steps until a stop-criterion is met [24l [26] . 
One commonly used approach is developed by Kalafut 
and Visscher (K & V). It positions every new step such 
that the Bayesian Information Criterion with respect to 
the noisy data is minimized [24]. It is a topic of current 
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research, if this is a valid assumption for change-point 
problems [25] . The algorithm does not require user input 
and stops when the addition of new steps is unfavorable 
according to the Bayesian Information Criterion. 

The K & V algorithm is a member of the larger class 
of step-finding methods which minimize a certain energy 
function m- However, since the K & V algorithm only 
adds new steps and does not remove previously found 
steps, it is not guaranteed that the global energy mini¬ 
mum is found m- Finding the global minimum is pos¬ 
sible, if these energy functions are convex. In this case, 
efficient algorithms can be used that yield good approx¬ 
imations to the underlying step signal m- However, for 
poor signal to noise ratio, these convex energy functions 
are too simplistic to optimally detect steps, resulting in 
an overfitting of the data, i.e. more steps are detected 
than are actually present. Thus, if steps are hidden in 
noise such algorithms behave as efficient filter-functions, 
and accurate step-detection requires an additional second 
stage on the filtered, i.e. denoised data |2Q] . 

Here, we present a novel two stage approach, termed 
Energy Based Stepfinding (EBS), where both stages are 
based on the minimisation of energy functions. In a first 
stage, we denoise the signal with a highly efficient and 
fast optimization algorithm. The algorithm minimizes a 
convex energy function in a process called total variation 
denoising (TVDN). We show that an optimal denoising 
can be found making the process effectively parameter- 
free. There is no further assumption about the noise 
necessary. Eor actual step detection, we proceed in the 
second stage of EBS with Combinatorial Clustering (CC) 
of the denoised data into steps. Such an approach is al¬ 
ready in common use in the computer vision community 
[28] and is both computationally efficient and fast. The 
energy functions used in CC belong to a more general 
class which allows the incorporation of prior knowledge 
such as the step size of the stepper to make the algorithm 
more accurate. We tested EBS with simulated data that 
were created based on experimental data of RNA poly¬ 
merase H (Pol H) movement. We compare the perfor¬ 
mance of EBS on the same simulated data to (i) a t-test, 
(ii) to the variable stepsize HMM and (hi) to the K & V 
algorithm. 

The analysis reveals that EBS performs faster and 
more accurately. We therefore applied the algorithm to 
detect steps in experimental data of the bacteriophage 
ip29 packaging motor and to determine pauses of Pol H 
transcription elongation in high resolution optical tweez¬ 
ers experiments. 


Methods 

Energy Based Model 

Starting from a large and noisy trajectory of motor 
protein movement we use an energy based step detection 
(EBS). To reveal the steps produced by the underlying 


biological system hidden in noise, one has to identify 
piecewise constant parts in the data set. This is done 
by taking the Welement noisy input data and creating 
an Welement output set of steps. Therefore one needs 
to penalize variations within neighboring variables in the 
signal. In contrast one needs to increase the energy if 
the free variables deviate too much from the measured 
signal. This is reflected in the energy function 

N N 

E{x,y) = '^D{\xi - Vil) + (1) 
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where y and x are the A/'-element input data and out¬ 
put variable vectors respectively. Minimizing the energy 
function is the conceptual baseline of our approach. It 
consists of terms where variables interact with the input 
data T)(-), as well as nearest neighbor interaction between 
two adjacent variables S{'). Unfortunately, depending on 
the actual shape of these terms the optimization prob¬ 
lem can get prohibitively computationally expensive [28| . 
One of the design goals of EBS was to work efficiently for 
large data sets on commodity hardware. Therefore, we 
chose an approach which, in the first stage denoises and 
smoothens the signal by minimizing a simple convex en¬ 
ergy function, solving the TVDN problem. The result 
of this stage is the set of denoised steps. Each step is 
characterized by its amplitude and length. We call the 
combination of amplitude and length from now on a tu¬ 
ple. The amount of tuples remains comparably low even 
for a significantly increased sampling rate. This makes 
EBS well suited for high bandwidth data consisting of 
a huge number of datapoints. Afterwards we use this 
smaller set of tuples and minimize a more sophisticated 
energy function in the CC stage, which is defined on a 
discrete ievei set and incorporates a step height prior. A 
flowchart of this two stage process is shown in figure 

Total Variation Denoising 

In the first stage of EBS, we separate noise from the 
actuai stepping signai. This stage works on the fuii and 
noisy ID input data set y = ^i, ...,^Ar G which can 
be quite iarge {N > 10^ datapoints). To denoise we 
minimize an energy function known as Totai Variation 
Denoising (TVDN) probiem [29] . 

x'^ = argmin q{x^ y) + p{x) 

^ N N-1 (2) 

= argmin - V \xi - + A W ja;* 

where the optimai soiution x'^ represents the denoised 
signai. The {yi} and {xi} are the i-th entry of the time- 
discrete input and soiution vector, respectiveiy. This 
optimai soiution is a tradeoff between prior knowiedge 
that the enzymatic steps yieid piecewise constant signais, 
which is introduced by p{x) = A J]] a function 
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FIG. 1: Flowchart of the two stage process for finding steps in noisy stepping data. The input data is first denoised 
via solving the convex TVDN problem. This process requires no intervention, as the regularization parameter A is 
determined automatically. This results in a lower dimensional, compressed representation of tuples (amplitude, 
length). This discrete data is then handed to a Graph Cut algorithm which solves a combinatorial clustering 
problem on a graph. The Graph Cut allows further customization by the use of regularization parameters pi and a 

pre-defined level set. 


which penalizes introducing steps. On the other hand 
the term q{x^ 2/) = ^ ~ penalizes deviations of 

the resulting solution from the input signal. The regu¬ 
larization parameter A is important for the solution 
and controls the relative weight of the two terms. The 
unique solution of this problem requires no assumption 
about the characteristics of the noise. Therefore the de- 
noising step works well in case of Gaussian white noise 
as well as more complicated colored noise. 

The energy function in Eq. (§ is strictly convex, which 
means regardless of the input data y there exists one 
unique solution x^ (see e.g. m)- We have applied a 
fast algorithm for solving the TVDN problem (appendix) 
which can easily handle millions of data points in a few 
milliseconds m- The algorithm scans forward through 
the signal. During this it tries to extend segments of the 
signal with the same amplitude, until optimality condi¬ 
tions derived from the TVDN problem are violated. If 
this happens the method backtracks to a position where 
a new step can be introduced, re-validates the current 
segment until this position and starts a new segment (ap¬ 
pendix). 

An open problem in the context of TVDN for step 
detection is how to choose the regularization parameter 
A such that as few as possible true steps are lost (false 
negatives) but still the data is not overfitted (false posi¬ 
tives). We propose a heuristic method to choose an opti¬ 
mal value for A, termed A/^, automatically. To motivate 
these heuristics we have a closer look to the two limits 
naturally imposed to A. For Xmin = 0 fhe TVDN al¬ 
gorithm perfectly reproduces the input signal such that 
x^ = y. On the other hand, the upper bound of sensi¬ 
ble values is marked by A = Xmax- Above this threshold 
the solution of Eq. (§ is constant x'^ = const for all i. 
The value of Xmax can be derived analytically from the 
underlying Eenchel-Rockafellar [32] problem (appendix). 

There exists a transition in TVDN while varying the 
regularization parameter from a stable minimization into 
the over fitting regime. Thus, by lowering A from Xmax fo 
Xmin one observes a sudden increase of steps produced by 



^max 


EIG. 2: Breakdown of the TVDN model dependent on 
X/Xmax- The number of produced steps and plateaus 
vs. different A. Eor small X/Xmax solving the TVDN 
problem reproduces the input signal and the number of 
steps equals the number of data points. Eor big X/Xmax 
the number of steps is significantly lower. The point 
Xh/Xmax (magenta) before the number of steps 
increases suddenly marks the value of the TVDN 
regularization parameter that we choose in our 
heuristic. Plotted is a constant signal with added 
Gaussian white noise (red), a signal with exponentially 
distributed dwell times and Gaussian white noise (blue), 
and the same signal without white noise (cyan). 


TVDN (figure . This marks the point when the TVDN 
minimization breaks down and the solution starts to fit 
noise. The breakdown also persists while varying sam¬ 
pling frequency or rate of steps as well as signal to noise 
ratio (appendix). To choose the optimal value of A, A/^, 
we use a line-search algorithm which detects the sudden 
increase in the slope of the number of steps in the result¬ 
ing signal and uses a slightly larger value. The sole input 
to this algorithm is the analytically determined value of 
Xmax- Therefore the A/i-heuristic provides us with a sta¬ 
ble parameter-free means to choose an optimized TVDN 
regularization parameter. 
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Performance Characteristics of the Total Variation 
Denoising Algorithm 

Our implementation of ID total variation denoising is 
based on the C code published together with [31]. This 
publication also provides a detailed outline of the TVDN 
algorithm, describtion of its working principles as well as 
the optimality condition it adheres to. 

Typically TVDN is addressed by fixed-point methods 
[54] . These methods reach the minimal theoretically pos¬ 
sible algorithmic complexity m- A different kind of ap¬ 
proach [31] uses the local nature of the total variation de¬ 
noising filter and provides a very fast, memory efficient, 
non-iterative way to solve Eq. <§• Although the the¬ 
oretical complexity of this algorithm is worse compared 
to fixed-point methods it actually achieves competitive 
or even faster results on signals which exhibit piecewise 
constant characteristics. For practical situation the com¬ 
plexity class of the algorithm can be assumed to be 0{N). 
Thus for such signals denoising of 10^ datapoints takes 
around 30ms on a recent 2.bGHz processor. 

After successful TVDN of the signal x'^ G consists 
of M steps. A step is characterized by a discontinuity 
between two neighboring plateaus with different ampli¬ 
tude a. It is beneficial to represent the signal not in the 
basis of indexed amplitudes x*, but instead to use tu¬ 
ples (a,u;)j, jf G 1, ...,M. Where aj is the amplitude and 
Wj is the length of the j-th plateau. By this change of 
representation the number of elements of the data set 
is typically reduced from several millions to a few thou¬ 
sand. This increases computational efficiency due to the 
fact that the complexity of following algorithms depends 
on the number of elements in the data set. Therefore, a 
compressed signal consisting of tuples opens up the pos¬ 
sibility to apply sophisticated step-detection algorithms 
on the data. The problem can now be cast as a Markov 
Random Field [33] and can be tackled by a CC method 
as will be presented in the following section. 


Graph cut and o-expansion used for combinatorial 
clustering to minimize energy functions 

As stated above, the input to the second stage of EBS 
are tuples of amplitude and corresponding length (a, re) 
of the compressed signal. To reveal the actual steps, these 
tuples have to be clustered on a discrete set of levels 
by minimizing an energy function. This means that a 
combinatorial version of an energy function similar to 
Eq. 0 has to be optimized. The length of a plateau plays 
the role of a weighting factor changing the contribution 
of a single tuple or a pair of tuples to the total energy. 
With these modifications a general energy loss function 


takes the form, 

E{^\{a,w))= y] 

iev 

^ ^ ^ 2,2 + 1 (^2 5 ^i + 1 1^2 5 + '^2 + 1 ) 

( 3 ) 

where the possible are taken from a set of levels C. 
The value of the data term Qi{') depends on deviations 
of from the input. The pairwise term •) encodes 

interaction potentials between neighboring plateaus. Es¬ 
sentially the problem means to cluster the tuples (a, w) 
to discrete levels, such that the joint configuration ^ min¬ 
imizes E(^). 

An elegant solution can be found by mapping the prob¬ 
lem onto a graph Q = (V, f), consisting of vertices V and 
edges £. For the simple binary case, where the tuples 
have to be assigned to only two levels, termed source s 
and terminal t, both of these levels as well as all tuples 
represent vertices V. £ denotes the set of edges con¬ 
necting the vertices (figure [3a| ) and each edge carries a 
capacity q > 0 (figure [31^. Therefore there are two types 
of edges, those connecting neighboring tuples and those 
edges connecting tuples to levels. The capacities of the 
former are encoded in the pairwise term •) and the 

latter are represented by the data term In the 

process of assigning a level to tuples the Graph Cut 
algorithm solves the following binary decision problem: 
Is the assignment to level t more favorable than assign¬ 
ment to level s in terms of the energy function? In the 
graphical representation this assignment is represented 
by a cut through edges of neighboring tuples and edges 
between tuples and the s and t level (figure . 

Due to the well known min cut/max flow theorem of 
graph theory the optimal energy coincides with the small¬ 
est sum of capacities of the edges one has to cut from the 
graph to disconnect s from t [35] . The cut splits the graph 
Q in two subgraphs: The part S which is connected to the 
vertex s and the part T which is connected to t. The al¬ 
gorithm we apply solves this problem in polynomial time 
(appendix). 

To make min cut/max flow useable for the above de¬ 
scribed assignment of multiple different levels it has 
to be embedded into an outer procedure. For this we 
use the ^-expansion algorithm [28] [36] . It finds provably 
good approximate solutions by iteratively solving Graph 
Cut problems on graphs representing the binary decision 
whether to alter the previous assigned level configura¬ 
tion or not [28] . For a multi level problem new levels are 
added successively in a random order. That means, once 
the graph has been optimized for i levels and the new 
i + Ith level is introduced, t corresponds to the assign¬ 
ment to the predefined level set and s to the new level. 
Again capacities for all edges are computed. With the 
new graph cut, vertices in the subgraph S get assigned 
their new level, the other vertices connected to T keep 
their previously assigned level. After having introduced 
all levels, in order to minimize the energy even more, the 
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FIG. 3: Cartoon representation of the graph cut algorithm. 3a) the initial graph structure we use to model the step 
signal. The nodes i G {1... M} represent the variables ^i. 3b) in a second step, energies are mapped to capacities of 

finds the max flow and cuts the graph into two 


edges. 3c) the Boykov-Kolmogorov Graph Cut algorithm 

subgraphs Q = S^JT where S is the part connected to s and T is the remaining part connected to t 


assignment can be optimized by iteratively reintroduc¬ 
ing the complete level set. This iteration stops when the 
overall energy is not decreasing anymore (appendix). 

In general finding the level configuration which coin¬ 
cides with minimal energy requires at least nondetermin- 
istic polynomial time. Graph cut algorithms provide the 
advantage to solve the problem in polynomial time, with 
the constraint to be just applicable to energies which ex¬ 
hibit a strong local minimum m This is the case if the 
pairwise terms Vij of the energy function satisfy 

'Pij 7) + 'Pij {o:,a) < Vij {j3, a) + Vij {a, 7 ) (4) 

for arbitrary levels tr,/d ,7 G £. This is also known as 
submodularity or regularity condition. 

Energy function for step-finding 

To perform CC we have to specify the energy function, 
Eq.® as well as the level grid. The levels can be cho¬ 
sen arbitrarily, and depending on the problem, provide an 
elegant way to introduce prior information. Often molec¬ 
ular motors move in discrete steps with known step-size. 
In this case, the spacing of the level grid can be chosen 
to match the known step-size. If such information is not 
known a priori or steps are expected to be nonuniform 
the levels have to be chosen with a refinement that cor¬ 
responds to the required numerical accuracy, i.e. with a 
sufficiently small spacing. 

To determine the relative importance of the terms Q 
and V in equation we introduce the parameters 
ps and pp which regularize the detected steps. 

The data terms Qi penalizes deviations of the proposed 
level amplitude to the original tuple amplitude at 


the vertex Vi 

Qi = PD ■ U>i\^i - Clil (5) 

where is a regularization parameter determining the 
importance of the data term, and Wi the weight of the 
current tuple. The most prominent plateaus are likely to 
be discovered by TVDN and contribute a tuple with a 
large weight. Thus, in order to preserve these plateaus 
the data term also depends on the weight Wi. 

For the case of an equidistant level set Vij consists of 
two different terms, a smoothing term and a term that 
favors steps of a certain size. The first and simpler pair¬ 
wise energy uses a Potts Model [38] to increase the energy 
whenever two assigned levels and differ 

7°+r = PS ■ K + Wi+i){l - 5{ii,£,i+i)) ( 6 ) 

where S{x^y) = 1 if x = y and S{x,y) = 0 else. Here ps 
is the smoothing parameter determining the energetic 
penalty for differing adjacent levels. The Potts model 
satisfies the submodularity condition, Eq. @,IS7]. A 
larger regularization parameter ps boosts clustering of 
the signal and therefore combines steps. There is no 
other a-priori bias towards combining steps due to the 
GG algorithm itself. 

The second more sophisticated contribution to the 
pairwise term in Eq.(|^ favors level changes of specific 
size between adjacent sites. 

This second pairwise term is optional if step sizes are 
uniform and it serves the purpose to introduce that prior 
information. Lowering the regularization parameter pp 
gives rise to the introduction of new steps with a special 
step height. The complete pairwise term Vi^j thus in¬ 
cludes prior information about step heights and is given 
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by 

pp{l - (5(^i,^i+i))(l - (5(|^i - ^i+i|,e)), 

with an expected average step height e determined by the 
underlying process. The depth of the jump height prior 
potential is given by the jump height parameter pp. In 
contrary to Eq. we chose this term to not depend on 
the weights of the adjacent sites to regularize step sizes 
independently of the corresponding dwell times. 

Note that not all pairwise terms constructed by Eq. 
Q strictly fulfill the submodularity condition, Eq. Q. 
Therefore we applied an extension to the graph construc¬ 
tion procedure proposed in [39] to circumvent a submod¬ 
ularity violation (appendix). The procedure truncates 
the energy until it satisfies 0. The procedure is applica¬ 
ble to any energy function and provides a provably good 
approximation for a single expansion move. Eor the com¬ 
plete Qf-expansion the procedure is applicable if most of 
the terms are submodular [39] . This is fulfilled by all sig¬ 
nals presented below: mean fraction of non-submodular 
terms (0.27 ± 0.08)%. 


Simulation Method and Definition of Parameters for 
Algorithm Comparison 

In order to quantify positional and temporal accuracy 
of the steps detected by EBS we use simulated data of 
noisy steps which are generated in a two stage process. 
In a first stage we generate a piecewise constant signal 
according to a simplified Pol II stepping model where a 
step is the product of an enzymatic process with a certain 
net rate. This model contains an elongation state with 
forward steps of Ihp in size generated using an effective 
stepping rate keiong- We also account for backtracked 
states which can be entered by a backward step of Ihp 
[4Ql ITT] with a rate In a backtracked state Pol II 
can step forward or backward by Ihp with the rates kf 
or kb respectively. 

Secondly, we simulate experimental noise including ef¬ 
fects of confined Brownian motion of trapped micro¬ 
spheres. To accurately reflect the experiment, we take 
into account changes in tether length and tether stiff¬ 
ness due to the motion of the enzyme. We apply a har¬ 
monic description of the trapping potentials and assume 
that the DNA linker can be described by a spring con¬ 
stant kpNA determined by the worm like chain model 
(appendix). 

In real experiments, the equilibrium position of the 
trapped microspheres is influenced by drift which leads to 
colored noise characteristics on long timescales. Sources 
of drift are, for example, pointing or power fluctuations 
of the trapping laser or temperature drifts. To analyze 
the influence of drift on the detected step signal we sim¬ 
ulate drift as a confined Brownian motion with a very 
slow time constant (^ lOmin) and a diffusion constant 


of lOnt^/s. This represents a stochastically fluctuating 
base line which is added to the simulated steps. Eurther- 
more, the drift signal is assumed to be small enough not 
to affect kinetic parameters of the stepping simulation. 
Using these parameters, the simulation produces drifts of 
around Inm on a timescale of ^ Imin (figure which 
can be even outperformed by current high resolution in¬ 
struments 0111]. 

We simulated a slow, an intermediate and a fast sce¬ 
nario with elongation rates of keiong = 4.1i7^, keiong = 
9.1Hz and keiong = 25.1772:, respectively. Eor the slow 
scenario we generated N = 2.5-10^ data points with time 
increments corresponding to a bkHz sampling frequency. 
Simulated signals of the intermediate scenario consist of 
N = 10^ data points with 2kHz sampling frequency. The 
computed standard deviation in both scenarios is b.5hp 
at the given sampling frequency. Eor the fast scenario we 
chose N = b ' 10^ data points and IkHz sampling rate. 
Moreover in the fast scenario we use higher noise ampli¬ 
tudes with a computed standard deviation of lO.Ohp at 
the IkHz sampling frequency. 

Eor EBS analysis of the noisy steps we have to choose 
the parameters pp, ps and pp as well as the level spacing 
for CC. Since our task is to optimize Eq. 0 we are only 
interested in relative values of the data and interaction 
function. Thus, we can arbitrarily set pp = 100. ps and 
pP are parameters that have to be defined by the user. 
The smoothing parameter ps has to be large enough to 
cluster small steps but small enough not to miss simu¬ 
lated steps. To this end, simulated data can be used to 
optimize parameters such that as many steps as possi¬ 
ble are recovered but only few false positives are created 
(appendix). We choose = 2, pp = 50 and use a level 
grid spacing of Ihp i.e. the simulated step-size. 

In order to compare different step-finding algorithms, 
we need to define a criterion when a detected step oc¬ 
curs at the correct time-point. A detected step is clas¬ 
sified correct whenever its temporal position lies with 
-^/S.y^indow of the simulated step. We choose the win¬ 
dow size such that A^indow = 1/(5 • keiong)• This allows 
for a small temporal shift of the detected steps with re¬ 
spect to a simulated step. The window is small enough to 
minimize classification of a step as correct by chance but 
large enough to make the step detection robust against 
numerical error. 

The definition of correct steps is further used to in¬ 
troduce two quantities that characterize step detection 
performance. The recall is defined by the number of cor¬ 
rect steps divided by the number of simulated steps and 
provides information about the completeness of the re¬ 
covered steps. The recall’s value is only meaningful in 
combination with a second quantity called precision. Pre¬ 
cision is defined by the number of correct steps divided 
by the number of detected steps, which is essentially the 
probability that a detected step is in the above defined 
time interval around a simulated step. 
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Detecting backtracked regions 


Results Discussion 


In transcription elongation periods of forward motion 
are oftentimes interrupted by backward steps. This so- 
called backtracking is important in-vivo for regulating 
transcription and therefore it is desirable to accurately 
detect backtracks in order to better understand regula¬ 
tion. Dwell times between detected steps are assigned to 
the set of backtracked states when they lead to a back¬ 
ward step. A backtracked pausing interval ends at a for¬ 
ward step that transfers Pol II back to the elongation 
state. At high noise and for fast steps we do not ex¬ 
pect that our method will perfectly find all backtracking 
events present. For example, short backtracks can be 
omitted resulting in a long dwell time between two for¬ 
ward transitions in the detected steps. However, since 
the rates of backtracking are slow compared to elonga¬ 
tion rates, we can correct for the missed detection of a 
backward step by a statistical hypothesis testing of dwell 
times, assuming that forward stepping follows an expo¬ 
nential waiting time distribution (appendix). The cor¬ 
responding mean dwell time can be estimated from the 
dwell time histogram of forward steps. Thus, dwell times 
which violate this hypothesis are also considered as back¬ 
tracked intervals, even if the actual backtracking step is 
not detected. 

A typical method for this separation is a Savitzky- 
Golay filtered velocity threshold pause detection 
(SGVT). SGVT finds backtracked regions in Savitzky- 
Golay smoothed data from histograms of instantaneous 
velocities [42]. These histograms show a pause-peak 
around zero velocity and an elongation-peak. One typi¬ 
cally defines a velocity threshold by computing the mean 
plus one (or two) standard deviation(s) of the pause-peak 
which is used to characterize paused regions in transcrip¬ 
tion data. A sensible choice for typical Pol II experi¬ 
ments of the Savitzky-Golay filter parameter is to use 
third order polynomials and a frame size of 2.5s |4|. We 
will compare the performance of the SGVT algorithm to 
EBS in determining backtracks. 


Reliable implementation of EBS 


We developed the EBS algorithm to determine steps 
in the trajectories of molecular motors (the software 
package can be downloaded at https://github.com/ 
qubit-ulm/pwcs) and tested this algorithm on simulated 
data of Pol II stepping using published rates (methods 
and appendix). We first simulated data using the inter¬ 
mediate scenario (methods). We simulated a trajectory 
of 505 (i.e. 10^ data points) resulting in 291 steps. In our 
simulation noise amplitudes are much larger than the Ibp 
steps of the simulated step signal (figure [^. TVDN effi¬ 
ciently removes noise and produces a set of 587 plateaus 
(figure [4a[). The TVDN data approximates the simulated 
step signal, but often decomposes a simulated step into 
several smaller steps. 

How well a particular algorithm can detect steps is 
best tested by computing the recall, i.e. the number of 
correct steps divided by the number of simulated steps, 
and the precision, the number of correct steps divided 
by the number of detected steps (methods). Eor ideal 
step detection both recall and precision have to be close 
to one. A step finder which exhibts low recall but high 
precision tends to underfit the simulated step signal. On 
the other hand, high recall but low precision is a sign of 
overfitting. Both, underfitting as well as overfitting are 
undesired since they may significantly distort statistical 
properties calculated from the detected step signal. 

Eor the simulated data the computed recall of TVDN 
is 69%, which is fairly high. However, this comes at the 
cost of a low precision of 34%. In the second step of EBS 
we use GG to cluster the denoised data to predefined 
levels of integer multiples of the known step size of Ibp. 
This results in a total of 176 found steps and thus many 
steps in the TVDN data are removed (figure [il^. TVDN 
visually traces the simulated data very well (figure 4a), 
however overfits the signal, i.e. there are many more de¬ 
tected plateaus than simulated steps. In this example, 
the CG algorithm performs much better, due to the high 
noise not all steps are recovered (figure [41^ . Some simu¬ 
lated steps were missed or fused to steps of double size. 
Compared to the TVDN the computed recall is slightly 
reduced, but the precision of 51% is much higher, show¬ 
ing that the data is fitted more accurately. The quality of 
the performance of CC depends on the value of the prior 
potential parameters ps^pp tuned to optimize precision 
and recall (appendix). 


Stability and Scalability of A/i-Heuristics 

The A/i-heuristic is the starting point of finding steps 
which are corrupted by noise and here we analyze the ap¬ 
plicability of this scheme on simulated data. In general 
we do not expect that this scheme returns good results for 
arbitrarily large noise amplitudes or sampling frequen- 




FIG. 4: EBS algorithm correctly detects steps in 
presence of high noise. [4a| noise reduction after 
application of TVDN. |4b| step detection using 
combinatorial clustering. Shown is a zoomed in interval 
of the simulated noisy data (grey points), boxcar 
averaged noisy data (20 times reduced, green), 
simulated steps (blue), denoised signal from TVDN 
(magenta, and detected steps after application of 
combinatorial clustering by Graph Gut (red, 4b). 


cies on the order of stepping rates. The dependencies on 
noise amplitudes and sampling frequencies for Poisson 
distributed steps (forward stepping with rate constant 
10 i^ 2 ;) covered by noise can be best summarized in the 
following phase diagrams (figure|^and|^. As for the data 
shown in figure we compute the number of produced 
steps after TVDN for different denoising parameter A. 
For a signal of lOOs length with 1016 Poisson distributed 
steps we vary the sampling frequency and keep the stan¬ 
dard deviation of noise constant at 4.46p (figure [^. For 
each sampling frequency the number of produced steps is 
normalized to the number of simulated data points. Fur¬ 
thermore, we vary the standard deviation of noise and 
keep the sampling frequency constant at QkHz (figure 

§• 

In the overfitting regime (white), the number of steps 
of the denoised signal equals the number of data points. 
At X/Xmax = 1 the denoised signal is constant without 
any steps. At a sampling frequency / = 10kHz the 
number of steps as a function of A has a clear transi¬ 
tion between overfitting and underfitting and resembles 
the data shown in figure (blue). As the sampling fre¬ 
quency is lowered the transition is shifted more and more 
torwards Xmax- Below a sampling frequency of 100Hz 
the number of produced steps are gradually increasing 
until there are as many steps as data points, as was al¬ 
ready observed for the data in figure (red curve). If 
the sampling frequency is this low, the A/^-heuristic is 
not applicable anymore since TVDN breaks down and 



ln(A/ Xyuo^x) 


FIG. 5: Over fitting transition depends on sampling 
frequency. Each step signal has 1016 poisonian 
distributed steps sampled with different frequencies and 
covered by noise. The signals are lOOs long. The 
colorbar shows the value of ln(n/V), where N is the 
total number of data points and n the number of 
denoised steps. 
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FIG. 6: Influence of SNR on over fitting transition. 
Each step signal has 980 poissonian distributed steps 
with different noise amplitudes at 6kHz. Every signal is 
1005 long and consists of 6 • 10^ data points. The scale 
of the colorbar is chosen analog to figure 


just imitates noise. At lOOHz there are on average 10 
data points for each plateau. Since steps are poissonian 
distributed many steps have plateaus that consist of less 
than 10 data points and are thus hardly distinguishable 
from noise. 

For decreasing signal to noise ratio (SNR) we get a 
similar shift of the phase boundary torwards Xmax foi* 
worse SNR, figure 


The Influence of drift on EBS performance 

Actual measured data exhibits drift stemming from the 
instrument. To analyze the influence of drift on step de¬ 
tection performance of EBS, simulated data (slow sce¬ 
nario) is used with different amplitudes of a stochastic 
drift. An example of such a drift can be seen in figure 
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7a It produces a deviation of 16.6nt in a time interval 


of 40s compared to the simulation without drift (figure 


7a, green arrow). This slightly influences the A-versus- 


number-of-steps curve of the TVDN heuristic (figure [7b|). 
In an intermediate regime of \/\max the additional low 
frequency fluctuations produce slightly more steps when 
drift is present (figure [7b{ red) compared to the same 
noisy step signal without drift (figure [71^ blue). 

Although EBS can eliminate most of these drift in¬ 
duced TVDN steps some false positives remain which 
decreases the algorithms step detection precision. We 
analyzed the influence of drift on the precision by succes¬ 
sively increasing the diffusion constant D of the drift sim¬ 
ulation {D e {0.0,1.7, 5.9,10.0,34.0,117.0}ntV5). For 
every value of D, 25 signals were simulated according to 
the slow scenario and the mean precision of step detection 
was plotted against the mean peak-to-peak difference of 
the drift of each signal (figure [^. For relatively small 
drift (6nt) precision decreased by only 1% and thus the 
influence of such a drift is rather negligible. Only the 
comparably large drift of 27nt decreases the precision to 
44% and thus introduces much more false positives than 
for signals without drift. 


EBS outperforms existing algorithms 

In the following we compare the performance of the 
EBS algorithm to commonly used algorithms for detect¬ 
ing steps in the trajectory of motor proteins namely, a 
t-test [18], (using an implementation that sweeps over dif¬ 
ferent window sizes, m), the Kalafut and Visscher algo¬ 
rithm (K & V), [24], (using the implementation from [20]) 
and the variable stepsize hidden Markov model (HMM) 

m- 

In order to quantitatively compare the results of the 
algorithms, we chose the slow, intermediate and fast sce¬ 
narios (methods). To get statistically meaningful results 
we simulated 25 time traces for each scenario. Input pa¬ 
rameters of the step-detection algorithms were adjusted 
once for each simulation scenario (appendix). After the 
analysis the detected steps were compared to the sim¬ 
ulated input steps by computing recall and precision 
according to our criterion of correctly recovered steps 
(methods and figure]^. 

For the slow scenario, around half of the simulated 
steps could be recovered by each of the four algorithms 
(figure [^. While the K & V algorithm recovers the 
fewest of the simulated steps (recall: 42%), the much 
larger precision (87%) shows that there are comparably 
few false positives among the detected steps. The other 
algorithms exhibit a somewhat smaller precision (t-test 
61%, HMM 64% and EBS 68%), but a higher recall (t- 
test: 50%, HMM: 45%, and EBS: 51%). This means that 
more detected steps are misplaced or shifted with respect 
to the simulated steps. Hence, for these conditions all 
four algorithms work well and recover a similar amount of 
steps in a close vicinity of the simulated steps. However, 


80 



time/s 

(a) 




(c) 


FIG. 7: Detection of steps in noisy signals corrupted by 
drift. 7a) Shown is the noisy signal with drift (grey 
dots), the drift (black), simulated steps + drift (cyan) 
and the detected steps (red). The drift is modeled by 
an Ornstein Uhlenbeck process {1/k = 9min, 

D = 10nt‘^/s) and mimics low frequency instrumental 
fluctuations around an equilibrium position. Here, over 
the simulated time interval of 50s the signal has a drift 
of 16.6nt (peak-to-peak, indicated by green arrow 
dotted lines). [71^ TVDN heuristic of noisy signals with 
simulated drift is preserved. In the intermediate regime 
where Xh is located, the A-heuristic curve of the noisy 
signal with drift (red) is slightly above the curve of the 
same signal. With increasing drift steps are 
detected less precisely. Shown is the precision (blue) 
and recall fredl of nea.k to neak difference of the drift 
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EIG. 8: Performance of step detection algorithms with 
respect to slow scenario (blue bars), intermediate 
scenario (green bars) and fast scenario (yellow bars). 
(8a) shows in percent of the total number of simulated 


steps the number of detected steps. ( |8b[ ) percentage of 
correct steps among the simulated steps. Error bars are 
SEM. (8c) shows average step size histograms with Ibp 
binning of the detected steps of the fast scenario for 
t-test, HMM, K&V and EBS (from upper to lower 
histogram). 




time/s 


EIG. 9: Dwell time distribution of detected steps in the 
fast scenario. Displayed are dwell time histograms (blue 
bars) of the t-test, HMM, K&V and EBS algorithm 
(from upper to lower panel). Eor better comparison, the 
histogram of simulated dwell times is depicted in the 
lowest panel. Each dwell time histogram is fitted by a 
double exponential decay (red line) which yields the 
following rates ih^ause!belong Hz): t-test (0.017/8.9), 

HMM (0.088/3.9), K&V (0.01/2.8), EBS (1.3/13), 
simulation (2.8/22). Note that in case of the detected 
dwell times the first two bars are not taken into account 
since steps with short dwell times are likely to be 
skipped by step detection algorithms. 


the K&V algorithm is a little more conservative towards 
placement of new steps thus increasing the precision but 
lowering the recall. 

In the intermediate scenario stepping rates are faster 
which clearly reduces the recall for the t-test (15%). This 
effect is less dramatic for the HMM (30%), K&V (26%) 
and EBS (30%). The precision of HMM (67%) and K 
& V (65%) are at a similar level followed by EBS (57%) 
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FIG. 10: Shows the Kullback-Leibler divergence of 
dwell time histograms of the detected dwell time 
distribution with respect to the simulated one. 


and t-test (50%). 

The fast scenario exhibits even faster steps and higher 
noise amplitudes and thus is the most difficult simulation 
setting considered here. The t-test recovers 12% of the 
simulated steps at around half the precision of the other 
algorithms showing the worst performance. The per¬ 
formance also decreased for the other three algorithms. 
However, compared to HMM (10%) and K & V (8%), 
EBS recovers approximately twice as many correct steps 
(19%) at a comparable precision (HMM: 37%, K & V: 
50% and EBS: 42%). 

The correct timing of a detected step, as described by 
the computed values of precision and recall is only one 
important aspect of step detection. It is also important to 
test whether the recovered step-size distribution resem¬ 
bles the simulated stepping behaviour (figure . Eor 
the fast scenario, due to the lower bandwidth and faster 
stepping rates the algorithms do not reproduce the simu¬ 
lated step size well. Here, all algorithms tend to fuse Ibp 
steps to steps of larger size which explains the smaller 
number of found steps compared to number of simulated 
steps (figure 8a, 8^ . While the t-test is showing a broad 
distribution of step sizes and both HMM and K & V 
detect mostly steps of size larger than 26p, the step size 
distribution obtained by EBS resembles the expected dis¬ 
tribution most closely. In contrast for the slow scenario 
step-size histograms show a majority of the expected Ibp 
steps for all algorithms considered here (appendix, figure 


18). Therefore, in comparison with the fast scenario it 


becomes evident how much the noise influences the step 
size distributions. Compared to the other algorithms, the 
denoising stage of EBS is the most robust. 

Important statistical properties of the underlying 
chemical cycle of a motor protein are often obtained from 
the distribution of dwell times, i.e. the duration between 
adjacent steps. To analyze the quality of the detected 
dwell times in the fast scenario, we compute dwell time 
histograms {25ms binning) from the detected steps of 
each algorithm and compare them to the distribution of 
simulated dwell times (figure [^. While the EBS derived 


dwell time histogram has a similar shape than the actual 
simulation input, the other algorithms fail to recover the 
general shape of the histogram. This observation is also 
reflected in the rate constants of a double exponential fit 
to the dwell time distribution (figurered curve). Here, 
rate constants extracted from the steps detected by EBS 
deviate about a factor of two from those extracted di¬ 
rectly from the simulated distribution. In contrast, the 
rate constants determined by the other algorithms devi¬ 
ate by several orders of magnitude when determining the 
pausing rate and are also considerably worse compared 
to EBS in determining the elongation rate. 

How well the detected dwell time distributions re¬ 
produce the simulation can also be quantified by the 
Kullback-Leibler divergence (figure [1Q|). As expected the 
Kullback-Leibler divergence of EBS is smaller compared 
to the other algorithms. Moreover, due to the slower 
stepping rates and smaller noise amplitudes in the slow 
scenario, dwell time histograms of detected steps are 
more similar to the distribution of the simulation than 
in the fast scenario (figure 10, blue bars). 

In summary, with properly adjusted parameters, none 
of the algorithms overflts the highly noisy data since pre¬ 
cision exceeds recall in all four cases and thus there are 
fewer detected steps than simulated steps (figure 8a and 


8b). Nevertheless, the low recall performance means that 


step detection accuracy is strongly compromised for the 
lower bandwidth signal of the fast scenario and directly 
extracting information of the underlying enzymatic cy¬ 
cles of elongation from dwell time fluctuations would re¬ 
sult in errors. 


EBS is orders of magnitude faster than competing 
algorithms 

Moreover, we also compared the run-times for all four 
algorithms on signals which contain 2.5 • 10^ data points 
and ^ 600 simulated steps. We chose rate constants ac¬ 
cording to the intermediate scenario and recorded the 
respective run times {trun)- EBS is the fastest algorithm 
with run times of ^ 5s . The t-test is 150 times , the 
K & V: 500 times and the HMM: 1000 times slower (ap¬ 
pendix, table [^). EBS is fast enough, that even very 
high bandwidth signals with 10^ data points (^ 900 sim¬ 
ulated steps) can be compressed very quickly, yielding a 
run time of only ^ 3mm (appendix). Therefore, EBS 
can process much more data points at comparably short 
run time and is essentially limited only by the available 
memory size (appendix). The ability to quickly process 
a large number of data points can be used to increase 
the accuracy of step-finding when the signal is sampled 
with higher rates. Eor example when using kinetics of 
the intermediate scenario, the recall can be increased at 
similar precision from 30% with 2kHz sampling rate to 
40% with 2^^kHz (appendix). 

In summary, in the slow and intermediate scenario the 
algorithms under consideration perform similarly in the 
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total number of steps found as well as in the number of 
correct steps. In the fast scenario where elongation rates 
are faster, bandwidth is lower and noise amplitudes are 
higher EBS shows better results. Moreover when using 
EBS, the results of the fast scenario could be improved 
by higher sampling rates which gives more data points for 
each plateau while still preserving comparably short run 
times. Thus, the EBS method especially excels for signals 
obtained from long measurement time, high bandwidth 
and poor signal to noise ratio. 


EBS detects sub-steps in experimental data of (/p 29 
DNA packaging 


Experimental data of Pol II transcription at saturating 
nucleotide concentrations yield rates comparable to the 
fast scenario. Eor these conditions the EBS as the best 
performing algorithm would be able to correctly detect 
only ^ 19% of all simulated steps. Therefore, in order to 
better test the step-finding properties of EBS on actual 
experimental data one would need to reduce the stepping 
rates, or apply the algorithm to a motor protein with 
larger step size. A prominent example for such a process 
is the packaging of DNA by the bacteriophage ip29 motor, 
which makes steps of lObp which consist of a burst of four 
steps with a size of 2.bbp each [12]. 

We have applied EBS to experimental stepping data of 
ip29 recorded with a bandwidth of 2.bkHz using opposing 
forces of around 5pN [la US- We used 2.bbp for the 
level grid spacing as well as for the jump height prior (e 
in Eq.0). The standard deviation of the experimental 
noise at this sampling frequency was found to be ~ S.Sbp. 
Eor this motor at low forces of a few pN a fast burst of 
four 2.bbp steps is followed by a long dwell time (figure 
l^a). The presence of 2.5bp steps had previously been 
identified at large forces leading to a slow down of the 
2.bbp steps [12]. At the forces of bpN the previously 
applied t-test had failed to resolve the 2.bbp steps. In 
contrast, some of the steps are detected by EBS (figure 
11 and appendix figure . 


EBS detects pausing of Pol II 

While the EBS algorithm is not able to determine a 
large fraction of steps of Pol II at saturating nucleotide 
concentrations given published noise levels, it can be used 
to investigate pausing of the enzyme. 

We use EBS to detect pauses (methods) in experimen¬ 
tal data from single molecule transcription elongation 
data of Pol II (M. Jahnel, S. Grill Lab). Eurther we 
compare the pauses extracted from EBS step data to the 
result of SGVT (methods) which is a commonly applied 
method from the literature [4]. The signal consists of 
A ^ 7 • 10^ data points and was recorded with a sam¬ 
pling frequency of IkHz. The noise amplitude has an 
estimated average standard deviation of ^ 10 bp. Thus 


the experimental data is comparable to the fast scenario. 
We used both SGVT as well as EBS to detect pauses and 
backtracks of the enzyme (methods, figure 11). When 
comparing the results from both algorithms one finds 
that most long pauses do overlap, while differences are 
observed for the detected short pauses. 

In order to get a better understanding of how well the 
two algorithms perform, we again use simulated data 
with parameters for stepping rates and sampling fre¬ 
quency according to the fast scenario (appendix). In ac¬ 
cordance with previously published discussions on back¬ 
tracked pauses [43] we distinguish long {t > tp) and short 
pauses (t < tp) by a time scale tp = \/= 0.8s. 
All simulated long pauses were found by EBS (100%) and 
the total length of long pauses compared to simulated 
long pauses was 113%. Also the SGVT found almost all 
long pauses (98%) with 94% of the total duration of sim¬ 
ulated long pauses. Both methods did not falsely assign 
long pauses and thus the result of finding long pauses in 
step detected data and in SG filtered data largely agrees. 

However concerning short pauses, EBS outperforms 
SGVT in recall ( EBS: 61%, SGVT: 38%) and precision 
(EBS: 92%, SGVT: 57%). 

Especially for experiments with near base pair reso¬ 
lution and slow elongation rates (i.e. k^iong ^ kf^kiy), 
SG filtered data is not suitable to distinguish between 
pauses and natural waiting times of elongation and hence 
step-detection becomes the only option. Eor these exper¬ 
iments pause-detection accuracy is very high and allows 
the analysis of dwell time fluctuations. This provides fur¬ 
ther insights into enzymatic reaction cycles such as DNA 
sequence dependent dynamics [44]. 


Conclusion Outlook 

We have presented a novel energy based step finding 
scheme comprised of a denoising stage that uses TVDN 
followed by a GG analysis. The GG stage uses a Graph 
Gut algorithm and provides the possibility to include 
prior information. Eor biomotors with unknown step size, 
GG can be performed without step size prior terms. If 
the detected steps exhibit a dominant step size, a sec¬ 
ond application of EBS with this prior information can 
improve results. The EBS algorithm outperforms current 
schemes for detecting steps or pausing events in time tra¬ 
jectories of molecular motors. In case of high-noise data 
it had the highest recall with comparable precision. The 
higher step detection performance of EBS is also reflected 
in the step size and dwell time distributions which better 
reproduce the simulated distributions. In particular, for 
the fast scenario where the recall is rather low, further 
analysis of the dwell time distribution returns useful rate 
constants in contrast to the rates extraced from dwell 
time distributions of the competing algorithms. In addi¬ 
tion EBS is much faster than competing algorithms. 

In particular the high computational speed of EBS be- 
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FIG. 11: (a) 2.bbp substeps in ip29 bacteriophage data (circles) measured in an optical tweezers experiment, EBS 
(blue) and t-test (red), (b) Paused regions in experimental Pol II transcription elongation data. Shaded regions 
indicate paused intervals found by the SG-filtering method with a velocity threshold of two standard deviations of 
the pause peak (green) and EBS (red). IkHz sampled transcription data (grey), SG filtered data of polynomial 
order 3 and frame size: 2.5s (cyan) and step detection result of our method (blue), (c) Step size histogram of 
detected steps by EBS in the Pol II data shown in (b). (d) Zoom into a detected pause. Shown is EBS signal (blue), 
SG filtered signal (black), measured data at IkHz (black circles) and paused regions (SG: green, EBS: red). 


comes an advantage when multiple executions of the al¬ 
gorithm are necessary. One example for an extension of 
EBS with multiple executions, is an iteratively adapting 
level grid which could be used for signals with unknown 
step heights. Similar schemes are already available for 
HMMs [H] and were successfully applied to ERET data 
[45] . Eor EBS, this could be implemented by methods 
from Multi-Model Eitting [46]. Another example could 
be to expand EBS to allow for drift correlation. As is, 
EBS is relatively insensitive to drift so that drifts on the 
order of \9hpjmin have a negligible effect on step-finding 
(results & discussion) [3]. However, one could explic¬ 
itly correct for drift by using a decorrelation scheme as 
previously developed m- Again this would necessitate 
multiple execution of EBS. 

A reason the proposed EBS method exhibits compet¬ 
itive performance is a favorable representation of infor¬ 
mation in the signal, which led to the two stage process. 
Here, we have used TVDN to build a fast and unbiased 
denoising scheme while still preserving the step features 
of the underlying signal. This was possible by using a 
drastically improved algorithm for solving the one dimen¬ 
sional TVDN problem which allows us to choose the reg¬ 
ularization parameter automatically. In fact, TVDN 
with this Xh performs often very well in tracing the ac¬ 
tual signal even under noisy conditions. Gonsequently, 
if TVDN is used as first stage, the choice of the regular¬ 
ization parameter is very important and can significantly 
influence the performance of further steps. 


Previously, TVDN has been applied in a step detection 
algorithm of the rotary flagella motor movement [20j |27] . 
The method to determine the parameter A developed here 
could be directly applied to this problem thus increas¬ 
ing the accuracy of the denoising scheme. Nonetheless, 
a more rigorous theoretical examination of the sudden 
change from over- to underfitting of TVDN which led to 
our heuristics remains to be done. Donoho et al. [48| 
have reviewed the observation that sudden break-downs 
of model selection or robust data fitting occur in high¬ 
dimensional data analysis and signal processing. They 
further refined this finding for Gompressed Sensing in 
m, which is a class of li regularized convex optimiza¬ 
tion problems. It remains an interesting question if simi¬ 
lar theoretical statements can be established for TVDN. 

There exist different ways to solve the subsequent clus¬ 
tering problem for step detection. Eor example when 
step sizes are uniform and the signal is periodic, such 
as for the above mentioned rotary bacterial flagella mo¬ 
tor, a Eourier transform-based technique with nonlinear 
thresholding in frequency space can be used m- 

In contrast the presented GC algorithm is broadly ap¬ 
plicable to non-periodic signals. We found that our im¬ 
plementation of GG is very well suited to cluster the out¬ 
put of the compression since it provides a framework to 
include prior information and it applies to a broad class 
of step signals including steps with non uniform sizes. 
Eurther there are comparably fast algorithms available 
to solve relevant energy functions. In fact, we found that 
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our algorithm scaled approximately quadratically in the 
number of tuple and linearly in the size of the predefined 
level set in our applications (appendix). The penalizing 
energy scheme can be extended in an intuitive way to 
other prior information. For example, a histogram prior 
could yield a global energy term that favors certain step 
sizes and dwell time histograms. 

The adjustment of the regularization parameters of the 
pi energy function can be guided by comparing results 
with simulated stepping data. This choice is not depen¬ 
dent on noise due to the preceeding application of TVDN. 
Further by using weights in the energy terms the regu¬ 
larization parameters can be applied to different datasets 
of the same underlying stepping process. 

Both, the TVDN stage as well as the clustering stage, 
provide the possibility to harness parallelization to gain 
speedups. A long high bandwidth trajectory could be 
divided into smaller time-intervals, which could then be 
treated in parallel. Of course one would need to find a 
way to take care of the boundaries between the intervals, 
e.g. by shifting the time intervals and merging the data. 
This extension would also make a quasi online processing 
of measurement data possible, where new intervals are 
successively ingested. 

EBS was successfully applied to detect pauses by Pol 
II as well as 2.5bp steps in the packaging of DNA by the 
bacteriophage (p29 motor. However, while some steps 
could be found, the larger the noise and smaller the step- 
size the fewer correct steps are found. To make fully 
use of the advantages of EBS higher bandwidth data is 
needed. Moreover, shorter tether length, smaller beads 
or stiffer handles provided by DNA origami m increase 
resolution and thus improve step detection. 

In summary, the EBS method fills the gap of tools 
which are able to handle high bandwidth data with many 
data points as well as very noisy data under quite general 
assumptions. Regardless of the difference in TVDN and 
Graph Cut the energy based model provides an intuitive 
access for the user of the method. 
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APPENDIX 

Determination of Xmax in TVDN 

In this section we want to show, how to determine the 
value of Xmax in TVDN analytically. The Xmax value 
determines the value of the regularization parameter A 
in equation © above which the solution x* remains con¬ 
stant and therefore contains no steps anymore. The infor¬ 
mation in the following subsections is twofold: First de¬ 
rive general expressions for the Fenchel-Rockafellar-Dual 
problem and the forward-backward splitting applied to 
TVDN, second we then derive a condition for Xmax from 
the Fenchel-Rockafellar dual problem and provide an an¬ 
alytical solution. Furthermore we give hints on the spe¬ 
cial (tridiagonal) structure of the involved operators. The 
definitions in the next sections follow the work of [52] . 


Fenchel-Rockafellar Dual Problem and Forward-Backward 
splitting 

Independent of the problem of our work, we start with 
a function f{x) which is convex, proper, and lower semi- 
continuous. Then 

G = maximize(x, rt) — f(x) (8) 

is called it’s Legendre-Fenchel dual function [52|. /* is 
also convex, and it holds (/*)* = /• A further special¬ 
ization is useful in the context of our work, as the TVDN 
problem consists of a minimization of two composed con¬ 
vex functions 


minimize/(x) + g{A{x)) (9) 

where A G and the convex functions / : ^ M 

and g :W We assume, that /* G and therefore 

there exists a Lipschitz continuous gradient. 

Due to the Fenchel-Rockafellar theorem, covered in 
Chapter 15 of [52], the following problems are equiva¬ 
lent: 

minimize f{x)Fg{A{x)) = — minimize f\—A^u)Fg\u) 

( 10 ) 

where f denotes the adjoint function. The unique so¬ 
lution of the primal problem x'^ can be recovered from 
a solution of the dual problem which has not to be 
necessarily unique. 

X* = Vfi-A^u*). (11) 

We use an additional assumption which is not a con¬ 
straint for the TVDN problem: g is simple. That means, 
one can compute a closed-form expression for the so- 
called proximal mapping 

I 2 

prox (x) = argmin- ||x - 2 :|| -h ^g{z) Vy > 0. (12) 


Further due to Moreau’s identity g^ is also simple [52]. 

Now having the connection between primal and dual 
problem at hand, this means, one has to solve again a 
composite problem of a convex and a simple function 

minimizeF(i4) + G(u) (13) 

with F{u) = f^{—A^u) and G{u) = g^{u). 

A typical method to do Proximal Minimization is 
Forward-Backward splitting (see eg. Chapter 27 of [52]). 
The dual update is given by 

^(^+1 ) _ — 7VF(i4^^^)^ . (14) 

In this update step 7 < I//2, where L is the Lipschitz 
constant. The primal iterates are given by: 

xiD . (15) 

The above general statements and theorems are taken 
from the tool set of Convex Analysis. For further back¬ 
ground see e.g. [32] or [52]. In the following we discuss 
more problem specific expressions. 


Application to Total Variation Denoising 


In a continuous picture the total variation of a smooth 
function 0 : M ^ M is defined as 

j(<^)=y'iiv<^(s)iid5. (16) 

In the discretized version one has to consider a discretized 
gradient operator A : ^ with p = n — 1. 

J{x) = \\Ax\\ = F,Ui (17) 


where Ui = and therefore A taking the following 

form: 


/I -I 0 
0 I -I 


VO ••• 


0 \ 


-1 

1/ 


(18) 


Using this and taking into account that the Divergence 
and Gradient operator are minus adjoint of each other 
((V/, g) = — (/, V-^)) the adjoint of the discrete gradient 
operator is minus the discrete divergence: 


At 


/-2 I 0 
I -2 I 

0 I '• 




: I 

Vo ... 1 -2/ 


(19) 






16 


Therefore the divergence highly resembles a typical 
laplace filter from signal processing. This leads for a 
single entry to Ui — Ui-i = — 2xi + For the de¬ 

viation of \max we assume the boundary conditions that 
1^0 = 0 and Un = 0. 

For noise removal (and to get the connection to eq. 
(©) the following problem has to be solved 

1 9 

x'^ = argmin- \\x — y\f + AJ(x), (20) 

xeR'^ 2 

To make use of the material so far choose the following 
composition 

fix) = ^\\x-yf and ^(u) = A ||m|| . (21) 


7 < 


PU|| 


(29) 


Inserting the above statements into the general dual up¬ 
date step from eq. (14), one gets the following expression: 


«('+!) = prox^G («(') - 7 VF(m('))) 


y(0 _ 7 VF(u(*) 

maxfl, 


(30) 


i(0 — y) 


max 1 


||u(0 —^A{A^ —y) 

A 


After that one has to translate f{x) and g{x) into their 
dual representations (u) and {u) by using the follow¬ 
ing relations Derivation of Xmax from the Proximal Iteration 


• For f{x) = l/2\\Ax — y\\ and A G can be 

inverted then 

fHu) = l\\iA^)-^u + y\f ( 22 ) 

• For f{x) = \\x\\p = ^ - {\xi\'^)p is a p-norm: Then 
the dual function corresponds with the indicator 
function lq of the convex set C: 

y{u) = (,||.||<i where - + - = 1 (23) 

" q p 


Finding a maximal regularization parameter A is equal 
to finding a a criterion, such that the dual iterations re¬ 
main constant V/ 

uiW) = , (31) 

By using eq. one can see, that this will lead to 

a steady state solution xf = const Vi. For simplicity 
assume A = A/y. Starting from the proximal iteration we 
find that in case of A < the problem 

simplifies to 


Using that we get the following dual representation of 
the dual functions F{u) + G{u) for the TVDN problem 
in the Fenchel-Moreau-Rockafellar formulation. 

F(u) = - \\y — A^u\\^ — - \\y\f and , , 

V ; 2 11^ II 2 (24) 

G{u) = ic{u) where G = {u : < A} . 

The solution to the dual problem can be obtained by 
solving 

G argmin \\y-A^u\\, (25) 

||«||<A 


and by applying eq (11) the solution to the primal prob¬ 
lem x'^ 


x* = y-A^u*. (26) 

What is missing for concrete expression for the forward 
backward iterations is first a closed form for the gradient 
of F, which is given by 

VF{u) = A{A^u-y). (27) 


Secondly it is possible for the proximal operator of G, 
which is the orthogonal projection on the set G 


prox^G'^ = 


u 

max(l, ||m|| /A)’ 


(28) 


y(i+i) = y(0 - Ay + AA\^^^. (32) 


To satisfy the constant condition from eq. 
has to be in the solution of: 


(31) the 


AA^u = Ay . 
The shape of AA"^ is the following 


AA^ 


/-3 3 -1 ... 0 \ 

1 -3 3 : 

0 1 -1 

: ■••-3 3 

Vo ... 1 -2j 


(33) 


(34) 


Linear equations with a tridiagonal affine transform AA^ 
can be efficiently solved for example an algorithm pro¬ 
posed by Rose [53] . 

Still missing is a treatment of the primal iteration step 
^(z+i) = y — The connection to the A in the 

original TVDN problem is given such that, the Karush- 
Kuhn-Tucker conditions are still valid for our steady state 
solution ( [^ . This means, that every in the dual 
solution has to satisfy 


^^g[-A,A]. (35) 
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To ensure this, we have to choose 

Amax = 11'^ I loo 


(36) 


B 

•- 

s 


A 


#- 

i 


t 


which gives as a clear statement how to choose a maximal 
lambda. 


Algorithm Implementing the A/^-Heuristics 

As outlined in the Methods section of the paper, we use 
a sudden increase of resulting steps when decreasing the 
regularization parameter A in the TVDN problem shown 
in eq. (© from Xmax to determine A/^. In the following, 
we want to describe the heuristic method, we used to 
choose the value of A/^,. Starting point for the algorithm 
is the value of Xmax on a curve like the one depicted 
in figure The iterative method shown in algorithm 
[2 approximates the point of steepest ascent in an X-n 
diagram, where n is the number of steps, by searching an 
interval where the slope exceeds the slope of the secant 
of {O^Xmax}- The function N{X) counts the number of 
steps after the TVDN minimization for a given value of 
A. 

This simple method gave us stable results for a variety 
of our test signals, either simulated or experimentally 
gathered. In the following section we have a closer look 
into the stability of the effect of sudden increase of steps. 


Algorithm 1 Outline of our line-search algorithm to de¬ 
termine Xh 


A, n 
A+, 


'’^max ? N{X 

1+ j— X 


X 


max) 

- N{\ max 

\N{0)-N{Xmax)\ 


/ 2 ) 




while less than max. iterations do 

y |n+-n| 

0^ X+-X 

if S ^ ^start then 
break 
end if 


9: 

A, n ^ 

A+, n 

10: 

A+, n+ 

^ Al 

11: 

end while 

P 

12: 

return Xh 

^ A+ 


Mapping of Energies on Edge-Capacities 


FIG. 12: Situation in an Markov Random Field 
concerning a single variable Vi and edges A, B to 
special variables s and t relevant for the data term, 


s A 

•- 


C 


t 


E 


#- 

s 


-4-• 

^ i+l t 


FIG. 13: Two neighbouring variables in a Markov 
Random Field and edges relevant for the pairwise term. 
Here the two s vertices represent the same vertex in the 
graph and are just drawn seperated to make the 
diagram look nicer. The same is true for the t vertices. 


the Markov Random Field. In this section this mapping 
is explained stemming theoretical foundations outlined 
by Kolmogorov et. al. in [64]. The Graph Gut algorithm 
solves this problem in polynomial time for a certain set 
of useful energy functions [34l (63] |64| . 

In figure [T^ the situation for the data term is depicted. 
Here the mapping is easy, as the energy for a single vari¬ 
able Vi for the current level Eq is mapped to the Edge A. 
The energy Ei for a new level is mapped to the edge B. 

The situation for the pairwise term Vij is more com¬ 
plicated and depicted in figure Here two variables Vi 
and Vi-^i are involved which leads to four different energy 
combinations Eo,o, ^ 0 , 1 , ^ 1 , 0 , are possible. Here 
Eq^q is associated with the energy value if both variables 
get assigned a new level. In contrast Ei^i represents the 
energy of both variables keeping their current levels. The 
two other combinations represent the case when one vari¬ 
able keeps the current label and the other gets the new 
level assigned. Eq^i the variable i E I keeps its level, for 
Ei^q this is the case for the variable i. 

The four energies can be represented in the following 
way 


In the process of assigning a level to vertex Vj the 
above mentioned Graph Gut algorithm solves a binary 
decision problem, whether the assignment of a new level 
is more favorable in terms of the energy loss function or 
not. The binary outcome of the decision is reflected in 
the graph structure by introducing two special vertices, 
where t is associated with keeping the old and s with 
assigning the proposed level. The energy values of the 
data term Qi as well as the pairwise term and their 

different combinations of keeping the current level or as¬ 
signing a new level are mapped to capacities of edges in 


^0,0 

^0,1 


a b 


a 

a 

_i_ 

0 b — a 

^1,0 

^1,1 


c d 


d 

d 


c — d 0 


The first summand on the right hand side is mapped 
to terminal capacities. This means that the capacity a is 
associated with the edge C, and the capacity d with the 
edge B. The second summand maps to the edge E and 
gets the capacity b — a E c — d. 

At this point the above mentioned strategy to circum¬ 
vent a violation of submodularity is applied if Eq^q E 
Ei^i > Eq^i E Ei^q. Then in turn Eq^i and Ei^q is in¬ 
creased and £’o,o is decreased by a small amount until 
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the submodularity condition Eq. @ is satisfied. Details 
and limitation of this approach can be found in [39] . 


o-Expansion Algorithm Outline 

Finding a solution that minimizing eq. (§ is a prob¬ 
lem that is in general NP-hard to solve for \C\ > 3. 
The iterative alpha-expansion algorithm outlined in al¬ 
gorithm finds provably good approximate solutions to 
this problem. 


Algorithm 2 o-Expansion outline 

1: ^ arbitrary labeling of sites 

2: while not converged do 

3: for all a E C do 

4: ^ argminE(^, 

5: if E{C) < then 

6: e' ^ r 

7: end if 

8 : end for 

9: end while 


case only a limited amount of terms are non-sub modular. 
In principle there exist more sophisticated Graph Cut al¬ 
gorithms that alter the mapping of the combinatorial val¬ 
ues of the energy to capacities of the edges of the graph 
m- In the same work, the authors compare performance 
of their more complicated optimization scheme for non¬ 
sub modular energies to truncating the energy as we did. 
For a small percentage of terms violating the submodu¬ 
larity condition no severe degradation of the performance 
was found so we stayed with the simpler method, as it is 
more accessible and easier to reason about. The impli¬ 
cations of non-submodular terms highly depend on the 
underlying dataset and the chosen pairwise energy func¬ 
tion. If, like in case of our label prior term 0 the non- 
submodular case is a rare event, the simple truncation 
procedure has a positive impact. Thus, the submodu¬ 
larity violation is a problem that has rather theoretical 
implications than practical importance for our applica¬ 
tions. 


Scaling of Graph Cut 


In each iteration the algorithm updates or moves the 
current labeling if it has found a better configuration. 
To achieve this, in each iteration, a new, randomly chosen 
label (a G £ is introduced and each site Vi has the choice 
to stay with the previous label or adopt the new proposed 
label a. The binary optimization problem is solved via 
a Graph Gut (lineof algorithm]^. This step is called 
(a-expansion due to the fact, that the number of nodes 
with the label a assigned could grow during this phase. 
The outer iteration stops if no new label assignments 
happened within two cycles. The (a-expansion algorithm 
was initially published by Boykov et al. in [28] . 


Relaxation of Submodularity Condition by 
Truncating Energy 

The submodularity condition Q imposes structure on 
the energy minimization problem which allows stronger 
algorithmic results. In this sense the concept of sub¬ 
modularity plays a similar role for discrete, combinatorial 
clustering as convexity plays for continuous optimization. 

The Max-Flow/Min-Cut algorithm we use for mini¬ 
mization relies on that the supplied energy function sat¬ 
isfying the sub modularity condition. Unfortunately, the 
pairwise term 0 . does not strictly satisfy the submodu¬ 
larity condition Q. Therefore we adopted a truncation 
scheme proposed by Rother et al. in [39]. The truncation 
procedure for a single term can be summarized as follows: 
Either Vij{P^^) decreased or Vij (/d,(a) or Vij (a, 7 ) are 
increased until the submodularity condition is satisified. 
This procedure is applicable to any energy function, and 
provides a provably good approximation for a single ex¬ 
pansion move. The authors of [39] limit suitability for the 


When analyzing high-bandwidth noisy time traces of 
the movement of molecular motors the CC step often 
limits run time performance. Most of all, perfomance 
is influenced by system size, i.e. the number of tuples 
and number of levels in the label grid set. To analyse 
the scaling behaviour for these two influences numeri¬ 
cally, we simulated 10 noisy Poisson step signals for each 
system size and label grid set and record computation 
times, fig. (14) sh ows mean and standard deviations as 
error bars. In fig.(I4a) system size was increased from 
250 to 3000 tuples and the number of levels offered to 
the combinatorial optimization problem was kept con¬ 
stant to around 800 levels. In this case computational 
time is expected to scale mostly with the complexity of 
the Boykov-Kolmogorov max flow algorithm which has a 
worst case complexity of 0{\edges\ • \nodes\‘^ • C) [34] . 
Where C is the cost of the minimal cut, \edges\ and 
\nodes\ are respectively the number of edges and nodes 
in the graph. For the type of graphs considered here, for 
each additional tuple in the input data set we have to 
add two edges which would give a worst case complexity 
of roughly 0{N^ • C). However, computation times fit 
well to a quadratic function meaning that for our signals 
the scaling behaviour is better than the worst case com¬ 
plexity (figure [l4a] red curve). 

The second case is shown in fig.(I4b). When the system 
size is fixed (here: 750 tuples ) and the number of labels 
increases (here: from 10 ^ to ca. 10 ^) by refining the label 
grid subsequently, the corresponding run times increase 
linear. This is in agreement with the theory behind multi 
label graph cut problems [34]. The ^-expansion offers 
new labels one by one in a random order until all labels 
were used and the iteration stops. Thus the observed lin¬ 
ear scaling in the number of labels is also expected from 
theory. 
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xlO^ 



num. of nodes 



num. labels/10 

(b) 

FIG. 14: Mean run time performance of combinatorial 
clustering stage for 10 simulated noisy step signals. (Mi 
Graph Gut computation time versus number of tuples 
for a label grid of 800 levels. Second order polynomial 
fit to computation times (red curve). |14b| Graph Gut 
computation time versus label grid size for a fixed 
system size of 750 tuples. Linear scaling of performance 
with increasing number of levels in the label grid set. 
The error bars are SEM. 


It is important to point out that due to the TVDN 
compression the expression above is an improvement for 
this type of step signals (high bandwidth, number of data 
points ^ 10^ but comparably few steps < 1000 ) com¬ 
pared to the Fourier transform accelerated HMM imple¬ 
mentation [22]: 0 {m'n‘^N'log 2 m) where m is the number 
of position states, n the number of molecular states and 
N the number of data points. Moreover, the direct com¬ 
parison of run times and memory consumption given in 
the main text shows that our algorithm is advantageous 
regarding computational resources compared to existing 
algorithms. 


FIG. 15: Graph Gut MGMG comparison. Energies of 
optimal solutions with increasing system size for 
MGMG and Graph Gut algorihms. 


Comparison of Graph Cut and Markov Chain Monte 

Carlo 


Since Markov Chain Monte Carlo (MGMG) methods 
are standard techniques to optimize an energy functional 
with Pott’s model terms like Eq.Q, we compare the 
Graph Cut method with a Metropolis Hastings (MH) 
sampling and simulated annealing (SA) optimization al¬ 
gorithm m- In each iteration we randomly generate a 
proposal assignment of labels. The new assignment of 
a site is accepted or rejected according to the standard 
MH rules. Moreover a logarithmic temperature sched¬ 
ule is used for SA. The temperature parameter is intro¬ 
duced as commonly done: p(x) oc exp{—E{x)/T). If an 
accepted proposal has smaller energy than all previous 
ones it becomes the new configuration that minimizes 
Eq. To compare the quality of the step detection re¬ 
sult we computed the energy, Eq. © with prior terms 
Eq. 0 for the energy minimizing solutions of Graph Cut 
and MGMG method. Fig. (15) . For a system size below 


350 tuples, computation times of the Graph Cut algo¬ 
rithm were always below lOs. Since MGMG is compu¬ 
tationally more complex longer computation times were 
used for MGMG, i.e. 45mm which allowed for 12 iter¬ 
ations of a SA temperature cycle. Each cycle consists 
of 20 subsequent cooling steps and in each step we iter¬ 
ate through 20000 proposals. Inspite of the significantly 
higher computational cost the MGMG solutions always 
have higher energies compared to Graph Cut and the ex¬ 
cess energy increases for larger systems. This shows that 
MGMG returns increasingly worse solutions compared to 
the Graph Cut technique when the number of input data 
grows for fixed computation time. As expected. Graph 
Cut shows an approximately linear increase in energy 
with linearly increasing system size. 

To conclude, the plain MGMG algorithm used here is 




































































20 


conceptually simpler than Graph Cut but computation¬ 
ally more expansive and also less suited to cluster the 
denoised steps optimally according to an energy func¬ 
tional. This finding in one dimension is not surprising, 
since similar observations had been made in 2D image 
analysis [28] . 


Noisy step simulations 


Single base pair steps are typically exceeded by noise 
fluctuations and most of the time it is not possible to 
judge by eye whether an algorithm correctly positioned 
steps. Therefore simulated data is necessary to show and 
compare the performance of step detection algorithms. 
We generate noisy steps in two stages as outlined in 
Fig.(16). First, we generate a piecewise constant sig¬ 


nals according to a simplified version of the linear ratchet 
model of Pol II [65] . This model contains elongation and 
backtracked states and reproduces the ability to pause 
[401111], but does not accurately reflect the temporal or¬ 
der of translocation and other enzymatic processes. Dur¬ 
ing elongation, Ibp forward steps are generated with an 
effective rate of keiong- This effective rate includes the 
process of translocation, NTP insertion and pyrophos¬ 
phate release. In our model catalysis, bond formation 
and PPi release are summarized by a rate Further¬ 
more, the NTP-binding net rate is k 2 = cntp * k^/Ko 
and the translocation net rate /ci = • k 2 /{k- + k 2 ). 

cntp is the NTP concentration, Kj:f = 9.2/iM the disso¬ 
ciation constant, = SSHz is the forward translocation 
rate of Pol II and k- = 6S0Hz backward translocation. 
The values of these constants are known from experi¬ 
ments [65]. The elongation rate is then determined by 

kelong = (1/^1 + 1/^2 + 1/fe) 

With a rate of ki,i = bHz the motor makes a backward 
step of identical size as the forward step and thus enters 
the backtracked state. The enzyme can further backtrack 
by a rate k^ = 1.3Hz or return to the original state with 
a rate kf = 1.3Hz (figure [Tt]) . 

The rates corresponding to a forward step (/c+, kf) or 
backward step k^, k- ) are modified under external 
forces according to/c(F) = k{0) • exp{±F• 0.17nm/{ki,T))^ 
where = 4.IIpA^ • nm and the plus sign in the expo¬ 
nent applies to rates of forward steps. Simulations were 
computed for an assisting force of d.bpN. At this force 
forward and backward diffusion rates are ki, = S.SHz^ 
ki) = 1.0 Hz and kf = 1.1 Hz^ in accordance with the ki¬ 
netic model. For numerical simulation purposes the rates 
above are divided by the simulation’s time increment to 
yield dimensionless quantities. 

The transitions between elongation and backtracked 
states are generated using the Gillespie stochastic sim¬ 
ulation scheme [56] for a single enzyme. Dwell times are 
sampled from an exponential distribution according to 
the respective rates. 

In a second step, we simulated experimental noise in¬ 
cluding effects of confined brownian motion of trapped 


micro spheres. To accurately reflect the experiment, we 
take into account changes in the tether length and in the 
tether stiffness due to motion of the enzyme. We apply a 
harmonic description of the trapping potentials and as¬ 
sume that the DNA linker can be described by a spring 
constant koNA determined by the worm like chain model 

m- 

To formulate the equation of motion of two trapped micro 
spheres tethered by DNA we choose the coordinate sys¬ 
tem such that the enzyme moves in x-direction. Further¬ 
more we assure that drag coefficients 7^ and the trapping 
stiffness ktrap,i are identical in both traps. With this the 
effective DNA length x can be described by the following 
equation. 


jx = —k • X P Frit) 


(38) 


where k = ktrap + ‘^koNA^ koNA is the DNA stiffness 
and 7 is the drag coefficient. Frik) is the thermal force 
which is treated as gaussian white noise: {FT{t)) = 0 and 
{FT{t)FT{t')) = 2kBTj6{t — t'). Eq.([3^ describes a so 
called Ornstein-Uhlenbeck process and can be solved and 
simulated by standard techniques of stochastic differen¬ 
tial equations m which is shown in the next subsection. 
Eq.(38) was derived for the static situation without po¬ 


sitional changes. However, a molecular motor which is 
attached between micro spheres by a DNA-tether will 
change the tether length during its activity. Thus, koNA 
is also changing and can be computed using the worm¬ 
like chain model [58] . 

In the simulations we use a trap stiffness of ktrap = 
0.2bpN/nm^ a drag coefficient of 7 = 0.8 • 10~^pN • sjnm 
corresponding to beads with 850nm diameter and an ini¬ 
tial length of L = 3kbp for the DNA tether. 

We simulated a slow, an intermediate and a fast sce¬ 
nario which differ by stepping speed, sampling frequency, 
number of data points and noise amplitudes. Sampling 
frequencies and number of data points of the slow sce¬ 
nario are / = bkHz and N = 2.5-10^ points, for the inter¬ 
mediate scenario: / = 2kHz and N = 10^ points and for 
the fast scenario: / = IkHz and N = 5-10^. The elonga¬ 
tion rate keiong of tho slow scenario k^iong = 4:.lHz can 
be expected at a NTP concentration of cntp = ImM. 
Since backtracking becomes more likely at these NTP 
concentrations we limited analysis to simulated data that 
shows a net forward translocation. This excludes analysis 
of simulated data which exhibits only backtracked states. 
Elongation rates of the intermediate {keiong = 9.1Hz) 
and fast scenario {keiong = 2b.SHz) are expected at 
Cntp = 20mM and cntp = lOOOmM respectively. The 
standard deviation of noise amplitudes are directly com¬ 
puted from the noisy input data. This is done by sub¬ 
tracting the simulated step signal from the noisy steps 
and computing the standard deviation of the remain¬ 
ing signal. In both scenarios, slow and intermediate, 
the computed standard deviation is 5.5bp at the given 
sampling frequency. Eor the fast scenario we choose 
N = b ' 10^ data points and IkHz sampling rate. More¬ 
over in the fast scenario we use higher noise amplitudes 
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stepping 

parameters 



noise 

parameter 



simulation of noise 


FIG. 16: Step and noise simulation procedure. State 
transition model and corresponding stepping rates 
determine the probability distribution from which a 
piecewise constant signals (first inset) is sampled. In a 
second step noise is simulated with the piecewise 
constant signals as the mean (second inset). 


The equation of motion of such a system of reads m-- 
7x = —Kx + FT(t) (39) 


where x = (xi, X2) is the x-coordinate of first and second 
bead. Furthermore drag coefficient, stiffness and thermal 
force are: 


7 = 



ktrap,l + koNA —^DNA 

. —koNA ktrap,2 + koNAj 


Frit) 


( FTA{t)\ 

\FT,2{t) J 


The thermal force fulfills the gaussian white noise proper¬ 
ties: = 0 and {FT,i{t)FTj{t')) = 2kBT^6ij6{t-t') 

The relative coordinate x = X 2 — xi which will be called 
X in the following can be simplified by assuming that 

7l — 72 — 7 kfrpQ^p i — ^trap,2 — I®. 


^][Ki LI 

^ ibp 

<>♦00 

hit kb 

FIG. 17: Simplified stepping model of RNAP II with an 
elongation and backtracked states. 


with a computed standard deviation of 10.06p at the 
Ik Hz sampling frequency. 

Finally, for all three scenarios 25 data sets were simu¬ 
lated and analyzed. Table |I] gives an overview over the 
simulation parameters. 


Simulating beads in a harmonic optical trap 


As described above we account for confined brownian 
motion of trapped beads in a dual trap optical tweezers. 
A harmonic description of trapping potentials is applied 
and we assume the DNA linker can be described by a 
WLG model with a spring constant k^NA- In the fol¬ 
lowing we briefly show the derivation of eq.(38) and its 
solution. We focus on the x-coordinates of two beads 
trapped in different optical traps and tethered by DNA. 


X = —27r/c • X H— Ft 

7 


(40) 


Where fc = {ktrap F ^k^NA )is the corner fre¬ 
quency of the system. koNA = kDNA{F, L) depends on 
force and length of the DNA tether and is calculated from 
the wormlike chain model m- During enzyme stepping 
koNA has to be updated repeatedly with respect to the 
external parameters force F and length L. Eq.([4Q|) de¬ 
scribes a so called Ornstein-Uhlenbeck process (OU) and 
can be solved and simulated by standard techniques of 
stochastic differential equations m- From Eq. ( [4Q| ) it can 
be seen that for timescales slower than the corner fre¬ 
quency fc noise behaves essentially as white noise. Eor 
faster timescales than fc noise rather has characteristics 
of brownian motion. In the following, the simulation of 
an Ornstein-Uhlenbeck process is described. We rewrite 
Eq.(40) in Ito form: 


dxt = —k ' X ' dt F \/2D • dWt (41) 

where k = {n F ‘^k^NA) D = ksT/^ is the diffu¬ 
sion constant and dWt infinitesimally describes brown¬ 
ian motion. Eor a finite time interval AWt = dWt' 
describes a standard normal distributed random variable 
Af{0^hF with standard deviation a = \/h. 

Eq. ( [4l| just describes a gaussian random variable with 
mean fi and variance [62] : 


Xt e Afin, a'^) = Af (^Xt-ie , (42) 

and a random path can be straightforwardly simulated 
starting from an initial position xq. 
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TABLE I: Overview of simulation parameters. Shown is elongation rate keiong^ corresponding NTP concentration 
cntp and rate constants of the backtracking state. Moreover, the standard deviation of noise cr^, sampling 

frequency / and number of data points N is given. 


scenario: keionglHz CNTp/mM khilHz khjHz kfjHz cfnlhp fjkHz N 


slow 

4.1 

7 

3.8 

1.0 

1.7 

- 6 

5 

2.5 • 10' 

intermediate 

9.1 

20 

2.3 

1.0 

1.7 

- 6 

2 

1 • 10® 

fast 

25.8 

1000 

2.3 

1.0 

1.7 

- 10 

1 
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FIG. 18: Histogram of step sizes of the slow scenario 


18a| and the intermediate scenario [l8b| for t-test, HMM, 
K & V and EBS (from upper to lower histogram). 


Details of algorithm comparison 


To achieve best results for the three simulation 
scenarios we need to tune the external parameters of 
the t-test, the HMM and the EBS algorithm. While the 
latter is described in the main text (only for the slow 
scenario we used ps = 1.5 instead of ps = 2) we will 
briefly explain how to adapt the other two algorithms to 
yield as many correct steps as possible but also to have 


a large fraction of correct steps among the found steps. 
For the t-test a minimum step size of 0.3nm and a 
shortest dwell of lOms was used. Moreover, the t-test 
threshold was 0.01, the binomial threshold 0.005 and the 
maximal number of iterations was 100. 

The HMM analysis was conducted with maximally 
100 iterations for maximum likelihood estimation of 
transition probabilities. More iterations did not give 
better results and fewer iterations (< 10) could not 
optimize the log-likelihood properly (data not shown). 
For the slow scenario 85 states were used and for the 
intermediate and fast scenario we used 140. To prevent 
memory overflow in the intermediate scenario, we per¬ 
formed box car averaging to reduce the number of data 
points by a factor of two. Furthermore, a grid spacing 
of 1/2 was used which proved to be better than a Ihp 
spacing. Since the HMM level grid has to be aligned by 
using noisy data as an input, a two times smaller grid 
spacing showed better results. In contrast the situation 
is advantegous in case of combinatorial clustering which 
constructs the level grid on already denoised data. 

The algorithm from Kalafut and Visscher has the great 
advantage that it works completely without parameters 
that have to be adjusted by the user. 

To complete the performance results of the algorithm 
comparison (figure the step size histograms of the 
step detection result for easy and intermediate scenario 
are given, fig.(p^. For slow stepping rates the step size 
histogra ms re semble the simulated step size ±.lhp quite 
well, fig. (18a). However, a deviation from the Ihp steps 
can be already seen in the i ntermediate scenario for the 
t-test and HMM, fig.( 18b). For EBS the majority of 
the detected steps are Ihp in size in both scenarios. The 
more difficult situation in the intermediate scena rio is 
reflected by the larger fraction of 2hp steps (figure mi 
red). 


Table [TT| summarizes computational speed and memory 
consumption of the algorithms for test runs with 100s of 
temporal length and rate constants according to the in¬ 
termediate scenario. We simulated data containing ^ 900 
simulated steps and successively increased the number of 
data points by increasing the sampling frequency. For 
each sampling frequency the standard deviation of noise 
was constant. EBS processed 2 • 10^ data points, sim¬ 
ulated with 200kHz sampling frequency in Amin. The 
other algorithms had comparably long run times and we 
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restricted computation times to ^ 50mm and memory 
consumption to a limit of 2.38GB. EBS can process 
much more data points at comparably short run time and 
is essentially limited only by the size of available mem¬ 
ory. The high bandwidth signals can be compressed very 
well by TVDN to a few thousand plateaus, which yields 
shorter run times for the subsequent CC. In contrast the 
t-test becomes slower at > 10^ data points since it has 
more possibilities to adapt window sizes. A limiting fac¬ 
tor in case of a lot of data points for the HMM beside 
processing speed is memory consumption of the Viterbi 
reconstruction. Taken together the more efficient compu¬ 
tation of EBS compared to the other algorithms allows 
for the analysis of high bandwidth data. This in turn 
increases can increase the performance of step-finding. 
To show how much the performance of step detection im¬ 
proves we use the intermediate scenario but increase the 
bandwidth (i.e. number of data points) from 2kHz to 
200kHz. In order to have the same standard deviation 
of noise at 2kHz sampling rate, the noise amplitude of 
the high bandwidth signal is increased accordingly. This 
increases precision and recall from 30% and ^ 60%) 
at the lower bandwidth to 40% and ^ 60%) at the 
higher bandwidth. Eor CC the same set of parameters is 
used for low and high bandwidth signal (methods). Due 
to the very fast denoising stage and the efficient compres¬ 
sion to tuples run times are still below 3min for 10^ data 
points and 200 — 300 steps. 



EIG. 19: Cumulative fraction of found steps for the 
EBS in the intermediate scenario. Plotted is the 
cumulative number of found steps/simulated steps of 
the signal plotted in Eig. 0 for different dwell times. 
The number of detected steps compared to simulated 
ones is smaller for short dwell time steps. Dwell time 
histograms with a binning of 81.3ms were determined 
for the detected and simulated steps respectively. The 
number of detected steps for each dwell time was 
divided by the corresponding number of simulated steps 
and cumulatively summed up. In total 60.5% of the 
number of simulated steps were found. 


TABLE II: Comparison of computation efficiency of the 
different step-finding algorithms. ^ 900 simulated steps 
on commodity hardware (i7-2600, 3.0GHz CPU Ubuntu 
System, AGB memory). Corresponding run times were 
recorded in matlab for the signal of the given size. Peak 
memory usage, i.e. resident set size (RSS) was 
measured with Linux’s proc information system. 



t-test 

HMM 

K & V 

EBS 

data points/10^ 

2.5 

2.5 

2.5 

2.5 

run time/s 

708 

4858 

2555 

5 

peak R88/GB 

0.17 

2.38 

0.14 

0.15 


Remarks on Example of TVDN and combinatorial 
clustering 


In order to get temporal information of the missing 
steps in the example given in the Results & Discussion 
section, fig. we compare the dwell time histograms 
of the simulated and detected steps. The cumulative 
fraction of found steps for a certain dwell time shows 
that steps with short dwell times are omitted with higher 
probability, fig. (19). 


Effect of prior information in combinatorial 
clustering 


To analyze the impact of the prior terms and level grid 
spacing on step detection quality we performed CC with 
differe nt pr ior potential strength and level grid spacing 
(figure |2Qa| ) . We varied the prior regularization param¬ 
eters ps and pp starting from = 0 to a maximum 
of Ps = 6, while the jump-height prior parameter was 
varied simultaneously such that ppjps — 12.5 remained 
constant. By increasing the prior regularization param¬ 
eters the preci sion of step detection can be increased 
(triangles, fig. poi] )). Eurthermore, precision can be 
improved by choosing a level grid with a sp acing of the 
simulated step size of Ibp (squares, fig.(20a)). 

The best result regarding the absolute number of correct 
steps and a small number of falsely detected steps which 
lie outside a certain time window around a n act ual step 
(Methods) was found for ps = 4, pp = 50 ( |20a[ ). 

This increases the number of correct steps from 34% of 
the steps found at vanishing prior potential to 48% for 
the l/Ahp level grid analysis (supplementary material). 
By increasing the spacing to Ibp the detection precision 
can be improved even further to a fraction of 54% 
correct steps of the found steps. 

The prior potential strength of the smoothing term ps 
can not be arbitrarily large since it would remove steps 
in favor of fewer large steps and thus reduces the number 
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of correctly recovered simulated steps. The variation of 
the prior term also effects step size histograms (figure 


20c). When no prior terms are present {ps = pp = 0) 


the detected step-size is oft entim es smaller than the 
simulated step-height (figure 20c upper panel). Opti¬ 
mization of the regularization parameters as well as an 
increase in level spacing improves the precision of the 
EBS algorithm as sh own in the histograms of detected 
step-sizes (figure [20^ middle and lower panel). 


EBS application to experimental data of 9^29 
bacteriophage 


The noisy 2.bkHz recording of a (p29 bacteriophage 
(figure has the characteristic dependency of the num¬ 
ber of denoise d step s on the A regularization parameter of 
TVDN (figure [211^ and algorithm finds the regulariza¬ 
tion parameter for optimal den oisin g, X^. Based on the 
TVDN result (red signal, figure [2Ta| ) and the prior infor¬ 
mation, that the (p29 bacteriophage performs snbst eps o f 
2.bbp [12], a level grid is formed (black lines, figure 21a). 
After combinatorial clusteri ng th e detected step signal is 
obtained (blue signal, figure [2Ta| ) . 


Application of EBS to find pauses in experimental 
transcription data 


In the following we discuss the determination of pauses 
in experimental Pol II data as an example of further post 
processing of the detected steps and compare EBS based 
pause finding and SGVT on simulated data. 

Eor the simulated Pol II steps dwell times are assigned 
to a pause when they lead to a backward step. The cor¬ 
responding pause ends when a forward step brings Pol 
II back to the elongation state. Eor the detected steps 
this criterion also applies, however unlikely long dwells 
are also considered as pauses, since the algorithm will 
not perfectly find all steps present. Given the limited 
bandwidth (IkHz), high speed (saturating NTP con¬ 
centration) of the enzyme and noise (standard deviation 
^ lObp) in the traces step detection performance should 
be similar to the fast scenario in our algorithm compari¬ 
son. One can expect that mostly very fast steps are lost 
(figure [l^ , i.e. fused to large steps. On the other hand 
that means that also short backtracks are likely to be 
skipped and instead a longer dwell time between two for¬ 
ward steps is returned by the algorithm. Nevertheless, 
these longer dwells can be identified based on statistical 
hypothesis testing. Assuming that dwell times of for¬ 
ward stepping {r for ward) follow an exponential waiting 
time distribution, we calculate the mean dwell time of 
forward steps to estimate the probability distribution. 


N 


{'^elongation) ^ {'^forward) ^ ^ '^forward 


i=l 


Since not all backtracked pauses are discovered, this es¬ 
timate of (Teiongation) also Contains longer dwell times 
at a skipped pause. Thus {rforward) can be larger than 
{'^'elongation) ^ud should be taken as an upper bound for 
the actual mean waiting time. 

Eurthermore we assume that forward steps obey an ex¬ 
ponential distribution of the following form: 

p{t) = jk-exp{-T/{T)) (44) 

iv 

Under these assumptions we can define a confidence 
level to discriminate between normal dwell times of 
elongation and unlikely long dwell times which are 
caused by undetected backtracks. 

The confidence level can be adjusted by comparing 
recovered pauses to simulated backtracked pauses. A 
good compromise is found when most of the pauses are 
recovered and none or only very few of them are wrongly 
found. 

To this end we simulated 10 data sets with stepping 
rates and sampling frequency of the fast scenario and 
a computed noise amplitude of ^ 6bp. The simulated 
data is processed by EBS and the paused regions are 
identified according to the criterion described above. We 
also identify paused regions by SGVT with a threshold of 
two standard deviations of the pause peak as described 
in the methods section. SGVT sometimes returns very 
short pauses which are not related to simulated ones 
and are presumably caused by high noise affecting the 
filtered signal. Thus we exclude pauses smaller than 
10ms in the SGVT analysis. Pauses found by EBS were 
always larger than 10ms and thus there was no need 
for such an additional post-processing step. Eor each 
detected pause we identify if it is a correctly found one 
by checking whether it coincides with a simulated pause. 
We also take into account that, either two detected 
pauses which are close but separated could overlap 
with a simulated pause, or that a single detected pause 
could cover two very close but separated simulated 
ones. Having identified how many pauses are correct, 
we can compute the recall (i.e. the number of correctly 
found pauses divided by the number of simulated 
pauses), the precision (i.e. the number of correctly found 
pauses divided by the number of found pauses) and 
the false discovery rate (EDR, i.e. number of wrongly 
found pauses divided by the number of found pauses). 
Moreover we compare the total cumulated length of 
all detected pauses to the total cumulated length of 
simulated ones. This value is relevant since a correct 
determination of the total length of pauses is important 
for determining pause-free velocities which are computed 
by excluding the paused intervals from the measured 
data. Table imi shows mean and standard deviation 
of recall, precision, EDR and total length for long 
(t > tp) and short pauses (t < tp) where the threshold 
for determining a long pause is tp = 0.8<s (Results & 
Discussion). Although in the fast scenario step detection 
performance is inappropriate for further dwell time anal- 
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FIG. 20: Prior terms and level grid regularize combinatorial clustering. (20a) Relative frequencies of correct steps 


among the number of detected steps (precision) as a function of prior potential strength 1/ps {ps/pp = 0.08 is kept 
constant) for simulated data using the intermediate scenario. Shown is the precision for clustering with a level grid 
of l/Abp spacing (black triangles) and with a spacing of Ibp (grey squares). ( |20c[ ) step size histograms of detected 
steps with a label grid of l/Abp without prior terms (blue), with a spacing of 1/Abp and prior terms (green) and with 
a spacing of Ibp and prior terms of the same strength (red). The computed prec ision corresponding to the three 

histograms is encircled with the respective color, Fig. (20a). 



FIG. 21: Detected steps and interm ediat e results of EBS application to the (p29 example (Results & Discu ssion). 
Shown is the noisy data (grey circles, |2Ta|), the correspond ing A -characteristic curve (green curve. Fig. ini^, T VDN 
with respect to optimal Xh (black b ox, |21bD an d red signal, level grid for GG (black horizontal lines, |21^ ) and 

steps detected by EBS (blue signal, 21a ))” 21c shows the sum of detected step sizes of a set of 40 p29 measurements. 


ysis, finding pauses still works well, fig. (22) and table III 
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TABLE III: Detection of short and long backtracks in 
simulated data by EBS and Savitzky-Golay (SG) filter. 
Shown is the number of correctly detected backtracks 
divided by the number of simulated backtracks (recall), 
the number of correctly detected backtracks divided by 
the number of found backtracked regions (precision) 
and the false discovery rate (EDR, number of false 
positives divided by number of found backtracked 
regions). Moreover, the total length of detected 
backtacks divided by the total length of simulated 
backtracks is given. Backtracks with a detected 
duration < 10ms were excluded. The uncertainties for 
recall, precision and EDR are SEM. 


recall/% precision/% FDR/% total length/% 

short pauses: 

SG filter 38 ± 7 57 ± 7 43 ± 8 70 

EBS _ 61zb4 98zb2 8zb2 _ 91 

long pauses: 

SG filter 98 ± 1 100 0 94 

EBS 100 100 0 113 



time/s 

EIG. 22: Backtracked pause detection in simulated 
data. Shown is the noisy input signal (circles), the 
simulated step signal (blue), the Savitzky-Golay filtered 
signal (black) and the detected step signal from EBS 
(red). Pauses in simulated data are highlighted in green 
and paused regions in step detected data are indicated 
by the blue shaded areas. 
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